DdD 3 © Dl 
GENETICS 



ORIGINAL RESEARCH ARTICLE 

published: 16 December 2013 
doi: 10.3389/fgene. 2013.00252 




On multi-marker tests for association in case-control 
studies 

Margaret A. Taub 1 , Holger R. Schwender 2 , Samuel G. Younkin \ Thomas A. Louis 1 and 
Ingo Ruczinski 1 * 

' Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA 

2 Mathematical Institute, Heinrich Heine University Dusseldorf, Dusseldorf, Germany 



Edited by: 

Xuefeng Wang, Harvard University, 
USA 

Reviewed by: 

Karen Conneely Emory University, 
USA 

Jun Chen, Harvard School of Public 
Health, USA 

'Correspondence: 

Ingo Ruczinski, Department of 
Biostatistics, Bloomberg School of 
Public Health, Johns Hopkins 
University, 615 N Wolfe Street 
Baltimore, MD 21205-2179, USA 
e-mail: ingo@jhu.edu 



Genome-wide association studies (GWAs) have identified thousands of DNA loci 
associated with a variety of traits. Statistical inference is almost always based 
on single marker hypothesis tests of association and the respective p-values with 
Bonferroni correction. Since commercially available genomic arrays interrogate hundreds 
of thousands or even millions of loci simultaneously, many causal yet undetected loci are 
believed to exist because the conditional power to achieve a genome-wide significance 
level can be low, in particular for markers with small effect sizes and low minor allele 
frequencies and in studies with modest sample size. However, the correlation between 
neighboring markers in the human genome due to linkage disequilibrium (LD) resulting 
in correlated marker test statistics can be incorporated into multi-marker hypothesis 
tests, thereby increasing power to detect association. Herein, we establish a theoretical 
benchmark by quantifying the maximum power achievable for multi-marker tests of 
association in case-control studies, achievable only when the causal marker is known. 
Using that genotype correlations within an LD block translate into an asymptotically 
multivariate normal distribution for score test statistics, we develop a set of weights for 
the markers that maximize the non-centrality parameter, and assess the relative loss of 
power for other approaches. We find that the method of Conneely and Boehnke (2007) 
based on the maximum absolute test statistic observed in an LD block is a practical and 
powerful method in a variety of settings. We also explore the effect on the power that prior 
biological or functional knowledge used to narrow down the locus of the causal marker can 
have, and conclude that this prior knowledge has to be very strong and specific for the 
power to approach the maximum achievable level, or even beat the power observed for 
methods such as the one proposed by Conneely and Boehnke (2007). 

Keywords: genome-wide association studies, linkage disequilibrium, multi-marker tests, multiplicity adjustment, 
single nucleotide polymorphisms 



INTRODUCTION 

Genome-wide association studies (GWAs) are a prominent 
approach to search for single-nucleotide polymorphisms (SNPs) 
associated with disease or other phenotypes. To date, results from 
more than 1000 GWAs have been reported, identifying over ten 
thousand DNA loci to be statistically associated with one or more 
of hundreds of phenotypes investigated (http://www.genome. 
gov/gwastudies). Typically test statistics and p-values are reported 
for each marker on the genomic array, and genome-wide sig- 
nificance for a SNP is declared if the p-value after Bonferroni 
correction is below a threshold for a desired family-wise error 
rate. Commercially available genomic arrays interrogate the geno- 
types of individuals at hundreds of thousands or even millions 
of loci, and p-values less than 5 x 10~ 8 are usually required to 
achieve genome-wide significance. Obviously, these levels of sig- 
nificance are difficult to reach unless the signal is very strong or 
the sample size is very large. However, the correlation between 
neighboring markers in the human genome due to linkage 



disequilibrium (LD) can be incorporated into statistical tests, and 
thereby increase the power to detect association under the same 
family-wise error rate. 

Reducing test- multiplicity by taking advantage of the observed 
marker correlation in LD blocks has been a very active field of 
research. Haplotype-based methods can be an attractive option to 
decrease the testing burden (Schaid et al., 2002; Chapman et al., 
2003), especially in settings where genetic diversity between sub- 
jects is low and/or markers are densely typed. However, most 
approaches avoid the phasing step for haplotype estimation and 
use the observed genotypes and/or the respective marginal test 
statistics and p-values instead to generate a single test statistic 
and p-value for the entire LD block. The approaches most sim- 
ilar to the traditionally employed Bonferroni method are those 
that estimate the "effective number of tests" based on the correla- 
tion structure and use those instead of the actual number of tests 
to control for the family-wise error rate (Nyholt, 2004; Li and Ji, 
2005; Moskvina and Schmidt, 2008). Fisher's inverse chi-square 
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test statistic (Fisher, 1932) is another choice to quantify departure 
from randomness in a set of multiple p-values. However, for cor- 
related data such as thep-values stemming from the markers in an 
LD block the inference has to be based on a proper null distribu- 
tion generated either by permutations (Chapman and Whittaker, 
2008) or adjustments to the degrees of freedom in the / 2 distri- 
bution (Makambi, 2003; Chai et al, 2009). Other methods based 
on the observed genotypes include some traditional multivariate 
procedures such as Hotelling's T 2 -test (Xiong et al., 2002), prin- 
cipal components analysis (Home and Camp, 2004; Gauderman 
et al., 2007) and Fourier transformations (Wang and Elston, 
2007), but also concepts borrowed from the statistical learn- 
ing and regularization literature, such as kernel methods (Schaid 
et al, 2005; Kwee et al, 2008; Mukhopadhyay et al, 2010; Wu 
et al, 2010; Pan, 2011), penalized regression (Basu et al, 2011), 
and the LASSO (Shi et al., 2007; Wu et al, 2009). Further, biosta- 
tistical concepts employed include latent variables (Wang et al., 
2009a), empirical Bayes methods (Goeman et al, 2006), likeli- 
hood ratio tests that simultaneously compare genotype means 
and variances across cases and controls (Wang et al., 2009b), and 
even hybrids that combine several of those approaches (Pan et al., 
2010). While the above methods are based on the observed data 
only, other approaches also include additional information such 
as publicly available data bases (Li et al., 2009) or gene sets and 
ontologies (Wang et al, 2007; Chasman, 2008; Holden et al, 2008; 
O'Dushlaine et al., 2009). The results of some comparisons of 
multi-marker tests in case-control studies to detect association 
with SNP sets have been reported, for example by Chapman and 
Whittaker (2008) and Ballard et al. (2010). 

Despite the advances made in methods development for multi- 
marker tests and the additional power that can be gained, the 
standard approach to analyze GWAs data still is to carry out single 
marker tests with Bonferroni correction. The somewhat limited 
use of the novel statistical methods is arguably due in part to the 
fact that some of these methods can be computationally demand- 
ing or that open source software is not always available. However, 
there are powerful multi-marker tests that are very easy to imple- 
ment and scale, including the approaches proposed by Seaman 
and Muller-Myhsok (2005) and Conneely and Boehnke (2007). 
Both methods are based on marginal score tests for each SNP, and 
the authors demonstrate how genotype correlations within an LD 
block translate into an asymptotically multivariate normal distri- 
bution for the test statistics, with a variance-covariance derived 
from the estimates of LD. As an alternative to computationally 
intensive permutation tests, Seaman and Muller-Myhsok (2005) 
propose to sample from this multivariate distribution to calcu- 
late the statistical significance of an observed test statistic, while 
Conneely and Boehnke (2007) propose to directly use the mul- 
tivariate cumulative distribution function to calculate ^-values, 
particularly for the multi-marker test based on the maximum 
of the absolute values of the observed marginal test statistics. 
Computational procedures to assess multivariate normal cumu- 
lative distribution functions are readily available, for example as 
implemented in the statistical software environment R (Genz and 
Bretz, 2009; Genz et al., 201 1). Both of these approaches are com- 
pletely data driven and do not require prior biological knowledge 
or external reference data. 



In what follows, we quantify the maximum power achievable 
for multi-marker tests to detect association in case-control stud- 
ies, which relies on the hypothetical assumption that the locus of 
the causal marker in an LD block is known. Similar to the deriva- 
tions in Seaman and Muller-Myhsok (2005) and Conneely and 
Boehnke (2007) we show that genotype correlations within an LD 
block translate into an asymptotically multivariate normal distri- 
bution for the score test statistics, and develop a set of weights for 
the markers that maximize the non-centrality parameter in the 
overall test statistic. We assess the relative loss of power of some 
alternative, data driven, and thus practical methods without such 
prior knowledge. We also use some simulations to explore the 
effect on the power that prior knowledge used to narrow down the 
locus of the causal marker has, and how much of the maximum 
achievable power it can reach. 

METHODS 

SCORE TEST STATISTICS AND CORRELATION STRUCTURES 

In a case-control setting we assume the retrospective risk relation 
to be 

zi x = Pr(G=l\x) = F(\i G + Q G x), (1) 

where x e {0, 1} is the fixed binary disease status indicator, and G 
is a function of the genotype that specifies the genetic model. In 
the following we assume that G e {0, 1}, for example encoding a 
dominant model for a bi-allelic marker (the more general coding 
is considered in the supplementary materials), but for simplicity 
still refer to G as the genotype. In this setting G is the random 
variable with E(G) = tt, the relative frequency of G = 1 in the 
study population. As usual, the parameters |Xg and 0g describe 
the relationship between x and tc via the link F, with 6g being the 
parameter of interest. We denote the disease status indicator for 

individual i £ {1 n} by X;, and the genotype for individual i 

by g{. Thus, ^ x,- is equal to the number of cases and n — ^Xi is 
equal to the number of controls. 

Henceforth, we assume that F is inverse-logit. To test the 
hypothesis of no genotype/phenotype association at a specific 
marker Ho : 9g = 0 (or equivalently, Ho : 7to = tti = t) we use 
the score test statistic 

7 !&-*)(&•-*) Tg , 7 . 

Zg = — / _ = „ = (2) 
y/nx(l — x)it(l — 7l) D G 

where x. = - ^ Xj and ft = g (introduced for example in Agresti, 
2012). In a study with an equal number of cases and controls we 
have 3c = 1/2 and thus, the above simplifies to 

lVn(fti-fto) ... 
Z G = z — 7= „ . i 3 ) 
2 yftfl - ft) 

where fti and tcq are the sample means for g in the cases and 
controls, respectively. Under the null hypothesis 9g = 0, the ran- 
dom variable Zq has mean 0 and variance 1 and its distribution is 
approximately normal for sufficiently large n. 

In addition to G, consider a second marker H and let ^ x = 
Pr(H = 1 \x). As in Equation (2) above, the relevant score statistic 
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Er =1 (*,--*)(fe,--i) _ Th 

jnx(l-x)%(l-%) ° H 



(4) 



with % = h. Setting E(H) = ^, the conditional distribution of H 
given G is p^ g = Pr(H = h\G = g), providing a measure of the 
LD between G and H. Consequently % = pi|o + rc(pi|i — pi|o) 
and 



r(G,H) = (p 1|1 -p 1|0 )(^ r - 



1/2 



(5) 



In the following, we derive the relation between the correlation of 
the test statistics Z G and Zh and the correlation between G and 
H under the null hypothesis (Gg = 0) and local alternatives, and 
defer the derivations for global alternatives to the supplementary 
material. 

Under the null hypothesis of no association, cov(Zg, Zh) = 
E(TgTh). Using Equations (2) and (4) we have that 



(6) 



E {E 



= E 



E (T G T H ) = E I (xi - x) (gi -lt)x^ (xi - x) (hi - £j 

I i > 

(xi - x) (gi - ft) x (xi - x) (hi - |j | g 
_ i i 

i 

xj^ixi- x) (pna - [pno + n x (pm - pno)]) 

i 

The last line follows from 

E{(hi - k) i g,} = (i - i)p% - i(i -p%) =p% - i (7) 

Assuming that the participants are unrelated and that ft = it, the 
above expectation simplifies to 



E {T G T H ) = E j (.* ~ *) 2 (Si ~ 



x {(P% - [Piio + w x (pi|i -Piio)])} 
= ^](x J -x) 2 £{( & -7t)p 1 | s } 

i 

= mc(l - x)E (g,pi\ gi ) - nE (p 1]gi ) 
= nx(l -x) i -pi|o) n(l - Tt), 



(8) 



which is equal to zero if pi|o = pi|i = p (no linkage between G 
and H), and equal to nx(l — x)n(l — n) if pi|o = 0 andpi|i = 1 
(perfect linkage). If n and % were known then the denominators 



Dq and Dh are constants, and thus 

nx(l -x) (pi|i -pi| 0 ) Tt(l - jt) 



cor(Z G , Z H ) = 



nx(l - xOVrcO - Tt)^(l - |) 



^ n n ^ f 71(1 ~ ) V2 



cor(G, H). 



(9) 



Thus, under the null hypothesis and local alternatives, subject to 
the approximation that fc and % are constants, the correlation 
between the test statistics at two markers equals the correlation 
between the marker genotypes. If there is no correlation between 
G and H (linkage equilibrium) and H is not independently causal, 
then H is not associated with disease status. If H is in LD with G, 
an association with the disease status is induced by G. 

For a local alternative (Gg = 0(1/a/m)) a first-order Taylor 
series approximation yields 



Z G % N(A G , 1) with 

A G = Q G {nx(l -3c)fc(l - fc)} 1/2 



(10) 



If G is "causal" for the trait of interest but H is not, then the cor- 
relation between G and H induces a non-zero 0h in the score test. 
Specifically, 

Z H % N (A H , 1) with 

A H = e H )nx(i-x)^i-i)) 1/2 

= e G x ( Pm -p m ){nx(l-xMl-i)} l/2 , (11) 

where the last line follows from ^ =p 1 | 0 + fc x (pi|i — pi|o). 
Note that Ah is induced, so that in case of linkage equilibrium 
(pi|i = pi|o) we have Ah = 0. Also note that Qh depends on both 
the odds ratio at the causal marker (Gg) and the covariation of the 
genotypes. 

MULTI-MARKER TESTS FOR ASSOCIATION 

Let Z = (Zi , . . . , Zk ) be the score test Z statistics from an LD 
block with K markers. 

The maximum z-statistic Z max 

We define the maximum z-statistic as Z max = maxi</t<jc {|Zj;|}. 
The null distribution of Z max depends on the correlation matrix 
R of the test statistics Z, and for large samples we have Z ~ 
Nk(0, R). The two-sided p-value for Z max can be derived from 
this multivariate distribution by calculating 



Pmax = 2 x {1 - 4> R (Z max o 1 K )} 



(12) 



where <J>r is the cumulative distribution function of the multi- 
variate normal distribution with mean vector 0 and correlation 
matrix R, Ik is a vector of ones of length K, and the symbol o 
denotes the dot product. 
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The Bonferroni corrected p -value 

We compute the Bonferroni p-value in a set of K markers as K 
times the p-value stemming from the most significant marker as 
given by Z max = max^^ {\Z k \}. 

The optimal linear combination Z op t 

We consider a block of correlated markers as the region of inter- 
est and assume that one of these SNPs is biologically associated 
with the trait of interest. The statistical associations at neighbor- 
ing markers are thus controlled by the strength of the correlation 
between the causal marker and other loci in the analysis. With 
locus-specific Z-scores being approximately normally distributed, 
a linear combination (L'Z) is optimal. Generalizing from the 
two-locus case to a block of K SNPs with m = pr(Gj; = 1) for 
k € {1, . . . , K}, we define 



A k = E(Z k ) = Q k B k 

B k = {nx(l -x)jtfc(l - ftjt)} 1 2 



(13) 



The values B k aie known and depend on the minor allele fre- 
quency, but in general the Q k are unknown. The optimal linear 
combination depends on the relative sizes of the A k and so an 
assumption on the relative sizes of the Q k is needed, and certain 
cases are discussed below and in the next section. We let Z denote 
the vector of the Z k and A denote the vector of the A k . 

We need to identify the JC-dimensional vector L opt that 
maximizes the non-centrality {£(L'Z)} 2 = L'AA'L subject to 
L'RL = 1, with the correlation matrix R computed from the 
genotype correlation structure. This is equivalent to finding 
the L that maximizes L'HL subject to L'L = 1, where L = 
R _1 / 2 L and H = R -1 / 2 A A'R -1 / 2 . A standard matrix theory 
result (which can formally be derived using Lagrange multipli- 
ers) yields that L is the normalized first principal component 
loading vector of H, and we have L opt = R _1 / 2 L so that Z opt = 
T' 7 

Note that L opt depends only on the relative sizes of the As. 
If Ajt = A, then L opt = R^l/CRI^) 1 / 2 where R++ is the sum 
of all entries of R 1 . Further, in the case that all minor allele 
frequencies in the block are the same, then all B k are identical 
and A is the product of a constant and the row (or column) 
of R corresponding to the locus which is biologically associated 
with the trait of interest. (This follows from an extension of 
Equations (10) and (11) to more than two markers.) In this case, 
H has a degenerate form with all but one entry (the entry on 
the diagonal position corresponding to that of the causal locus) 
equal to 0. This yields that L opt is simply a vector equal to 1 at 
the causal locus, and zero otherwise, i.e., Z opt = Z causa \ for sets 
of markers with equal minor allele frequency. Thus, even though 
the associations in the remaining markers of the block are only 
induced by the causal SNP, there is in general more information 
in the optimal linear combination of score statistics than in the 
statistic from the causal locus alone, since the minor allele fre- 
quencies across a set of markers are virtually always non-identical, 
unless the markers are in perfect LD (and thus, every marker con- 
tains the same information about statistical association with the 
phenotype). 



We also note that the expectation of the optimal linear com- 
bination is £(Z opt ) = L opt A = (A'R -1 A) 1 / 2 , and thus the non- 
centrality is 



{£(Z op t)} 2 = ^L' opt A^ 2 = (A'R _1 A) . 



(14) 



If K is very large, care is needed in computing R _1 . However, in 
most situations either R will be relatively small (limited to the size 
of an LD block) or will have considerable structure with many 
zeros and non-communicating subsets and so only matrices of 
small to medium size will need to be inverted. 

The "agnostic " linear combination Z eq 

As in the case of the optimal linear combination Z opt above we 
consider a linear combination of z-scores, albeit without any prior 
knowledge of the location of any causal variant. This lack of 
knowledge comes into play in choosing a value of A, and we 
assume a uniform prior over the set of possible causal variants. 
In this case, the expected non-centrality is 



{£(L'Z)} 2 = L' 



±£a«[a<»]' 



where A® is the vector of expected values of the test statistics, 
assuming that marker k is the causal marker. More specifically, 
A* ' is the k th column of R that has been component-wise mul- 
tiplied by the B k s. In this case, we proceed as described above to 
find L op t, with our matrix H given by 



H = R 



•1/2 



i£)A«[A«]' 



-1/2 



This approach maximizes the pre-posterior expected non- 
centrality, although this does not guarantee better performance 
than Z max . 

The sequence kernel association test (SKAT) 

For comparison with the above methods, we also include the 
sequence kernel association test (SKAT), a widely used method 
for SNP-set analysis based on a logistic kernel-machine approach 
that allows for flexible, covariate adjusted relations between a 
genotype and the outcome of interest (Wu et al., 2010). Analyses 
were carried out using the publicly available SKAT R pack- 
age (http://cran.r-project.org/web/packages/SKAT) with default 
settings that produce a linearly weighted kernel, with weights 
inversely proportional to minor allele frequency. 

RESULTS 

SIMULATIONS BASED ON ASSUMED LD AND ALLELE FREQUENCIES 

We simulated a "naive" population under a dominant disease 
model using the R package bindata (http://cran.r-project. 
org/web/packages/bindata). We simulated 5-locus haplotype 
blocks with exchangeable (compound symmetry, CS) and auto- 
regressive lag-1 (AR1) correlation structures, with correlations 
between 0 and 0.8. For all markers in the haplotype blocks we 
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chose constant minor allele frequencies in this simulation, set 
at either 5 or 25%. One causal marker was selected and haplo- 
types were sampled to generate cases and controls as given by the 
genetic risk model, using a variety of odds ratios (1, 1.1, 1.4, and 
1.7). We generated 50,000 samples of 1000 cases and 1000 con- 
trols, and carried out marker-specific score tests to generate sets 
of test statistics. 

We investigated the type I error and power for the approach 
using the maximum z statistic, the optimal linear combination 
of the test statistics with known causal locus (comb opt ), and the 
"agnostic" linear combination of the test statistics assuming equal 
prior probabilities for each marker in the block to be causal 
(comb eq ). In addition, to mimic some limited biological infor- 
mation available, we show results for a linear combination of test 
statistics assuming equal prior probabilities for the causal and one 
additional marker, narrowing the set of potentially causal mark- 
ers to two out of five (combpair). For the compound symmetry 
simulations each pair of markers that contains the causal one is 
equivalent. For the auto-regressive lag-1 simulations, we show a 
pair of markers with correlation p and a pair of markers with cor- 
relation p 2 . Estimates of type I error and power are the fraction 
of simulations with p-values lower than the set significance level 
assuming two-sided tests. We also include the results derived for 
the Bonferroni correction with the significance level divided by 
the number of markers assessed. In addition to the typical sig- 
nificance level a = 0.05, we also assessed the different methods 
using a much stricter significance level for type I error con- 
trol, as is usually done in GWAs. These extreme tail probabilities 
were estimated using importance sampling (see supplementary 
material). 

With the exception of the conservative Bonferroni correc- 
tion all methods were well calibrated under the null hypoth- 
esis, for both types of correlations (compound symmetry and 
auto-regressive) and both minor allele frequencies considered 
(Figure 1). For much stronger type-I error control however all 
approaches can be slightly conservative, in particular in settings 
with auto-regressive correlation structure (see supplementary 
material). 

As expected, the optimal linear combination with correctly 
specified causal locus outperforms all other methods, yielding the 
largest power for odds-ratios of 1.1, 1.4, and 1.7. For this method, 
the estimated power was virtually constant for all magnitudes 
of correlations across markers within a block, for both simu- 
lated compound symmetry at low (Figure 2) and high (Figure 3) 
minor allele frequencies, as well as auto-regressive correlation 
structures (Figures 4, 5, respectively). Also as expected, the rela- 
tive loss of power for the other methods is worst for uncorrelated 
markers, and decreases with increasing correlation. For perfectly 
correlated markers, all methods except Bonferroni are equivalent. 
The data-driven maximum z-statistic which does not require any 
biological knowledge or other input generally performs better 
than combeq and Bonferroni, and thus, is a practical and more 
powerful method than conventionally employed approaches. 

Not surprisingly, the relative power of the Bonferroni method 
is particularly poor for highly correlated markers with com- 
pound symmetry when the genetic signal is weak (Figures 2, 3, 
left column). If the prior information about the locus for the 



causal variant is not very strong, little if any improvement can 
be achieved compared to the maximum z-statistic method. In the 
hypothetical case when the causal locus can be narrowed down 
to one of two loci in the LD block, comb pa i r occasionally yields 
modestly higher power than the maximum z-statistic method, but 
this is typically only the case when the two markers are in strong 
LD. However, for large effect sizes and weak correlations in the 
LD block, combp a i r performs at times even worse than multiple 
comparison correction via Bonferroni, and particularly notice- 
able at low minor allele frequencies (Figures 2, 4, right columns). 
As expected, comb pa i r with p always yields higher power than 
combp a ; r with p 2 in the auto-regressive setting (Figures 4, 5). 
The performance of comb eq is arguably the worst, and particularly 
poor in settings with high minor allele frequencies and strong 
signal (Figures 3, 5, right columns). 

SIMULATIONS BASED ON GENOMIC DATA 

As realistic examples of LD and minor allele frequencies, we also 
simulated data based on haplotypes in two regions of chromo- 
some 10 and one region on chromosome 22 delineated from the 
genome scans (Illumina Human660W-Quad BeadChip array) of 
4251 European American participants in the Lung Health Study, 
a NHLBI-supported multi-center randomized clinical trial in the 
United States and Canada to determine whether or not a program 
of smoking intervention and use of an inhaled bronchodilator 
could slow the rate of decline in pulmonary function in smokers 
with mild airflow limitation (Kanner et al, 1999). 



Compound Symmetry Autoregressive Lag 1 




0.00 0.Z0 0.35 0.50 0.65 0.80 0.95 0.00 0.20 0.35 0.50 0.65 0.80 



correlation p 

FIGURE 1 | Calibration under the null hypothesis for compound 
symmetry (left) and auto-regressive (right) correlation structures, 
assuming equal minor allele frequencies of 5% (top) and 25% (bottom) 
in a block of 5 markers. The fraction of rejected null hypotheses of no 
association (estimated type I error, y-axis) simulated in 50,000 data sets 
each is shown as a function of the between-marker correlation (x-axis). Each 
line represents a different method (see the following figures for details), 
with the Bonferroni method (red) showing the sharp decline as p increases. 
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OR = 1.1 OR = 1.4 OR = 1.7 




0.00 0.20 0.35 0.50 0.65 0.80 0.95 0.00 0.20 0.35 0.50 0.65 0.80 0.95 0.00 0.20 0.35 0.50 0.65 0.80 0.95 





020 0.35 0.50 065 0.80 0.95 




correlation p 




020 035 0 50 065 0 60 0 95 



FIGURE 2 | Comparing analytic strategies in the setting with 
compound symmetry correlation structures, assuming equal minor 
allele frequencies of 5% in a block of 5 markers. The fraction of 
rejected null hypotheses of no association (power, y-axis) in 50,000 



simulated data sets is shown as a function of the between-marker 
correlation (x-axis) for assumed odds-ratios of 1.1 (left), 1.4 (middle), 
and 1.7 (right). Each row highlights a different method, as labeled along 
the right-hand side. 



On chromosome 10 we chose two smaller blocks of 8 and 7 
markers respectively (top of Table 1 with lower minor allele fre- 
quencies and weaker LD; top of Table 2 with larger minor allele 
frequencies and stronger LD). On chromosome 22 we chose a 
larger block of 24 markers (mean .R 2 of 0.35, minor allele frequen- 
cies between 0.07 and 0.41, median of 0.28, see supplementary 
material). This block is part of a previously identified region of 
strong LD observed in a Caucasian population (Dawson et al., 
2002). We used inferred haplotypes in these regions to simulate 
case-control data sets with varying degrees of association between 
a hypothetical causal locus and the phenotype. For each block and 



a set of five different effect sizes (odds-ratios of 1, 1.1, 1.25, 1.4, 
and 1.7) we generated 50,000 samples of 1000 cases and 1000 con- 
trols, and carried out marker-specific score tests to generate sets 
of test statistics. 

In the eight marker block with the lower minor allele fre- 
quencies and weaker LD, we observed that all methods were 
well-calibrated under the null hypothesis (odds ratio equal to 
1), with the exception of the conservative Bonferroni correc- 
tion. The optimal linear combination with correctly specified 
causal locus again outperformed all other methods, yielding the 
largest power for odds-ratios of 1.1, 1.25, 1.4, and 1.7 (bottom 
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FIGURE 3 | Comparing analytic strategies in the setting with 
compound symmetry correlation structures, assuming equal minor 
allele frequencies of 25% in a block of 5 markers. The fraction of 
rejected null hypotheses of no association (power, y-axis) in 50,000 



simulated data sets is shown as a function of the between-marker 
correlation (x-axis) for assumed odds-ratios of 1.1 (left), 1.4 (middle), 
and 1.7 (right). Each row highlights a different method, as labeled along 
the right-hand side. 



of Table 1). The data-driven maximum z-statistic which does 
not require biological knowledge or other input showed a slight 
loss in power compared to the optimal method, however per- 
formed substantially better than any of the other approaches 
considered, including the hypothetical cases where the causal 
locus could be narrowed down to one of two loci (comb pa i r 
in Table 1), even when those two markers were in somewhat 
strong LD (correlation of 0.68 between marker 4 and the 
causal locus 5). In this example, the "paired" approach per- 
formed even worse than multiple comparison correction via 
Bonferroni. 



For the seven marker block with the larger minor allele 
frequencies and stronger LD we observed similar results (bot- 
tom of Table 2). However, the hypothetical case where the causal 
locus could be narrowed down to one of two loci yielded 
higher power when the two markers were in strong LD (cor- 
relation of 0.81 between marker 3 and the causal locus 4). 
Interestingly, the sequence kernel association test showed very dif- 
ferent performances for these two simulation scenarios. While 
properly calibrated under the null, the power for the simula- 
tion on the block of eight markers with overall lower minor 
allele frequencies and weaker LD was substantially lower than 
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FIGURE 4 | Comparing analytic strategies in the setting with 
autoregressive correlation structures, assuming equal minor allele 
frequencies of 5% in a block of 5 markers. The fraction of 
rejected null hypotheses of no association (power, y-axis) in 50,000 



simulated data sets is shown as a function of the between-marker 
correlation (x-axis) for assumed odds-ratios of 1.1 (left), 1.4 (middle), 
and 1.7 (right). Each row highlights a different method, as labeled 
along the right-hand side. 



the power of most of the other methods (Table 1). On the 
other hand, for the simulation on the block of seven markers 
with larger minor allele frequencies and stronger LD, the per- 
formance was very competitive. One possible explanation for 



this behavior is the weighting scheme in SKAT - by default, 
low frequency variants carry higher weights than common vari- 
ants. In the first setting, the marker with the lowest minor 
allele frequency (i.e., the highest weight) has only a very 
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FIGURE 5 | Comparing analytic strategies in the setting with 
autoregressive correlation structures, assuming equal minor allele 
frequencies of 25% in a block of 5 markers. The fraction of 
rejected null hypotheses of no association (power, y-axis) in 50,000 



simulated data sets is shown as a function of the between-marker 
correlation (x-axis) for assumed odds-ratios of 1.1 (left), 1.4 (middle), 
and 1.7 (right). Each row highlights a different method, as labeled 
along the right-hand side. 



weak correlation to the causal, much more variable SNP (p = 
—0.14), while in the second setting all markers have about 
the same minor allele frequency, and are much more strongly 
correlated. 



The simulation on the 24 marker block yields similar results 
in general, but some qualitative differences are noteworthy. The 
optimal linear combination again performs best overall, although 
for odds ratios of 1.4 the maximum z-statistic shows slightly 
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Table 1A | Linkage disequilibrium between SNPs measured by 
Pearson's correlation coefficient along with the SNP minor allele 
frequencies in an eight marker LD block on chromosome 10, 
simulated based on genome scans of samples from the Lung Health 
Study. 
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Table 1B | Estimated type-l errors (for OR = 1.00) and power 

(OR = 1.10, 1.25, 1.40 and 1.70) for different methods of addressing 

multiple comparisons in the eight marker LD block. 



Odds ratio 


1.00 


1.10 


1.25 


1.40 


1.70 


Bonferroni 


0.035 


0.099 


0.520 


0.904 


1.000 


Maximum Z 


0.050 


0.128 


0.576 


0.925 


1.000 


comb 0 pt 


0.051 


0.184 


0.688 


0.958 


1.000 


COmbeq 


0.049 


0.090 


0.266 


0.515 


0.869 


comb pa j r (markers 5 and 4) 


0.051 


0.121 


0.442 


0.774 


0.986 


combp a i r (markers 5 and 7) 


0.050 


0.061 


0.114 


0.192 


0.328 


SKAT 


0.048 


0.058 


0.118 


0.282 


0.799 



Marker 5 was assumed to be the causal locus. 



higher power (Table 3), likely due to the somewhat fragmented 
LD across the 24 markers observed in our population (see sup- 
plementary figures). As before, the "paired" approach performs 
well when the two markers are in strong LD (the causal markers 
12 and marker 13 have an R 2 of 0.97), and yields unsatisfactory 
power when the markers are not in strong LD (R 2 of 0.05 between 
markers 12 and 3). The performance of SKAT again is affected 
by the distribution of minor allele frequencies in the block and 
the observed LD. While the causal marker 12 has an appreciable 
minor allele frequency of 0.28, some other markers show much 
less variation, and thus receive more weight in the default settings. 
Here, the lowest minor allele frequency (MAF = 0.07) is observed 
for marker 3, which is in very low LD with the causal marker (.R 2 
of 0.05). Overall, and similar to previous results, the largest gain 
of power for the optimal linear combination relative to the other 
methods is seen at lower odds ratios. 

DISCUSSION 

We evaluated three approaches to controlling multiplicity in 
GWAs: standard Bonferroni, the correlation-calibrated maximum 
statistic, and a theoretical benchmark: the optimal linear combi- 
nation of locus specific test statistics which requires knowledge 
of the causal locus. Computation of the latter two depends on 
the correlation among the test statistics; the performance of each 



Table 2A | Linkage disequilibrium between SNPs measured by 
Pearson's correlation coefficient along with the SNP minor allele 
frequencies in a seven marker LD block on chromosome 10, simulated 
based on genome scans of samples from the Lung Health Study. 
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Table 2B | Estimated type-l errors (for Off = 


1.00) and power (Off = 


1.10, 1.25, 1.40 and 1.70) for different methods of addressing multiple 


comparisons in the seven marker LD block. 








OR 1.00 


1.10 


1.25 


1.40 


1.70 


Bonferroni 0.023 


0.184 


0.843 


0.997 


1.000 


Maximum Z 0.036 


0.234 


0.879 


0.998 


1.000 


combopt 0.051 


0.321 


0.937 


1.000 


1.000 


COmbeq 0.051 


0.288 
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1.000 


combpair (markers 4 and 3) 0.051 


0.293 


0.910 


0.999 


1.000 


combpair (markers 4 and 6) 0.051 


0.299 


0.916 


0.999 


1.000 


SKAT 0.051 


0.298 


0.920 


0.999 


1.000 



Marker 4 was assumed to be the causal locus. 



Table 3 | Estimated type-l errors (for OR = 1.00) and power (Off = 
1.10, 1.25, 1.40 and 1.70) for different methods of addressing multiple 
comparisons in a 24 marker block on chromosome 22, simulated 
based on genome scans of samples from the Lung Health Study 
(correlation and LD structure shown in the supplementary materials). 



OR 




1.00 


1.10 


1.25 


1.40 


1.70 


Bonferroni 




0.026 


0.107 


0.634 


0.967 


1.000 


Maximum Z 




0.047 


0.159 


0.717 


0.979 


1.000 


combopt 




0.051 


0.199 


0.741 


0.974 


1.000 


COmbeq 




0.050 


0.137 


0.528 


0.860 


0.996 


combpair (markers 


12 and 13) 


0.050 


0.192 


0.721 


0.968 


1.000 


combpair (markers 


12 and 3) 


0.048 


0.072 


0.187 


0.361 


0.689 


SKAT 




0.050 


0.063 


0.136 


0.283 


0.710 



Minor allele frequencies in the region ranged from 0.07 to 0.41, and marker 12 
(MAF 0.28) was assumed to be the causal locus. Marker 13 has a MAF of 0.28 
and R 2 of 0.96 with marker 12, marker 3 has a MAF of 0.07 and R 2 of 0.04 with 
marker 12. 



depends on this correlation. We reiterate that the correlation 
among the test statistics is essentially identical to the biologi- 
cal correlation amongst the genotypes (the LD structure) and 
this can be estimated. For an additional comparison to the 
above methods, we included the sequence kernel association 
test (SKAT), a widely used method for SNP-set analysis based 
on a logistic kernel-machine approach that allows for flexible, 
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covariate adjusted relations between a genotype and the outcome 
of interest. 

Simulations show that the two correlation-dependent 
approaches are well-calibrated under the null hypothesis. As 
expected, unless the correlations are very small, the Bonferroni 
approach is conservative. In the context of the test Z-scores being 
well approximated by a multivariate normal distribution, the 
optimal linear combination dominates all other approaches, but 
this optimality is quite fragile, depending on having identified the 
causal locus or one in high LD with it. If the causal locus is poorly 
selected, our linear combination using "best guess" weights as 
one example, the properly-calibrated Max statistic often performs 
better, sometimes by a substantial amount with these relations 
depending on the magnitude of the correlations, their pattern 
(compound symmetry or auto-regressive), and the magnitude of 
the genotype-phenotype association. We do see gains in power 
using a linear combination where we have narrowed down the 
set of candidate loci in our block, particularly in the case of 
very small effect sizes. The haplotype-based simulations produce 
similar comparisons, but with generally smaller differences 
amongst the approaches. 

Overall, the calibrated maximum method is very effective at 
maintaining power compared to use of a linear combination. 
However, when the causal locus is correctly specified, the optimal 
linear combination can confer a considerable increase in power. 
Therefore, there is some room for error and we have also provided 
an approach by which it is possible to specify prior probabilities 
on the loci and then use the induced, optimal linear combination. 
As we have shown, if there are two markers in a block that have 
a higher prior likelihood of being associated with disease (e.g., 
due to damaging functional prediction), putting higher weights, 
or all weight, on these will provide robustness to mis-specification 
of the causal locus, while providing more power than the Max 
test in some cases, especially for very low effect sizes. However, 
our equally weighted case is equivalent to giving each locus equal 
prior probability and its generally poor performance indicates 
that some focus is needed. 

We also found that the sequence kernel association test (SKAT) 
run with its default values is a very competitive method in settings 
when LD within a block is strong — which also implies similar 
minor allele frequencies between markers, as high R 2 values are 
mathematically not possible between SNPs of very different allele 
frequencies. On the other hand, the power in the simulations 
with lower minor allele frequencies and weaker LD was lower 
for SKAT than the power of both the maximum and the opti- 
mal linear combination tests. We conjecture that this is due to the 
default weighting scheme in SKAT — up-weighting less common 
variants — while in our simulation the marker with the lowest 
minor allele frequency and thus the highest weight had only a very 
weak correlation to the assumed causal marker. Thus, we believe 
that the default SKAT is most useful for blocks with high LD, 
and for association tests under the common assumption of higher 
penetrance for lower allele frequency variants. We also note that 
SKAT allows for weighted burden tests, which we did not consider 
in this manuscript. 

One challenge for all methods of this type is the dependence 
on having pre-defined blocks of interest. There are several existing 



methods for estimating LD-structures (e.g., Stephens et al., 2001; 
Gabriel et al., 2002; Browning and Browning, 2009) which can 
be used to identify LD-blocks. Here, we have estimated the LD 
by computing correlation values of the encoded genotypes using 
the data set at hand, rather than external databases, which avoids 
incorporating mis-specified structure due to differences in sample 
populations compared to an external reference. This is in contrast 
to the often recommended usage of external data, and it will be 
informative to investigate in detail the impact of ambiguous LD 
blocks (such as the 24 marker block from chromosome 22) for 
any of the considered methods. 
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